Abstract
Webpage segmentation is the basic building block for a wide range of webpage analysis methods. The rapid development of Web technologies results in more dynamic and complex webpages, which bring new challenges to this area. To improve the performance of webpage segmentation, we propose a two-stage segmentation method that can combine visual, logic, and semantic features of the contents on a webpage. Specifically, we devise a new model to measure the similarities of the elements on webpages based on both visual layout and logic organization in the first stage, and we propose a novel block regrouping method using semantic statistics and visual positions in the second stage. This two-stage method can effectively conduct webpage segmentation on complicated and dynamic webpages. The performance and accuracy of the method are verified by comparing with two existing webpage segmentation methods. The experiment results show that the proposed method significantly outperforms the existing state of the art in terms of higher precision, recall, and accuracy.
- Alexa. 2016. The top 500 sites on the web. Retrieved from http://www.alexa.com/topsites.Google Scholar
- Shumeet Baluja. 2006. Browsing on small screens: Recasting web-page segmentation into an efficient machine learning framework. In Proceedings of the 15th International Conference on World Wide Web. ACM, 33--42. Google ScholarDigital Library
- Ziv Bar-Yossef and Sridhar Rajagopalan. 2002. Template detection via data mining and its applications. In Proceedings of the 11th International Conference on World Wide Web. ACM, 580--591. Google ScholarDigital Library
- Lidong Bing, Rui Guo, Wai Lam, Zheng-Yu Niu, and Haifeng Wang. 2014. Web page segmentation with structured prediction and its application in web page classification. In Proceedings of the 37th International ACM SIGIR Conference on Research 8 Development in Information Retrieval. ACM, 767--776. Google ScholarDigital Library
- Ahmet Selman Bozkir and Ebru Akcapinar Sezer. 2018. Layout-based computation of web page similarity ranks. International Journal of Human-Computer Studies 110 (2018), 95--114. Google ScholarDigital Library
- Deng Cai, Shipeng Yu, Ji-Rong Wen, and Wei-Ying Ma. 2003. VIPS: A Visionbased Page Segmentation Algorithm. Technical Report. Microsoft technical report, MSR-TR-2003-79.Google Scholar
- Deepayan Chakrabarti, Ravi Kumar, and Kunal Punera. 2007. Page-level template detection via isotonic smoothing. In Proceedings of the 16th International Conference on World Wide Web. ACM, 61--70. Google ScholarDigital Library
- Deepayan Chakrabarti, Ravi Kumar, and Kunal Punera. 2008. A graph-theoretic approach to webpage segmentation. In Proceedings of the 17th International Conference on World Wide Web. ACM, 377--386. Google ScholarDigital Library
- Yu Chen, Wei-Ying Ma, and Hong-Jiang Zhang. 2003. Detecting web page structure for adaptive viewing on small form factor devices. In Proceedings of the 12th International Conference on World Wide Web. ACM, 225--233. Google ScholarDigital Library
- dataset-popular 2014. A dataset of popular pages (taken from dir.yahoo.com) with manually marked up semantic blocks. Retrieved from https://github.com/rkrzr/dataset-popular.Google Scholar
- dataset-random 2014. A dataset of random pages with manually marked up semantic blocks. Retrieved from https://github.com/rkrzr/dataset-random.Google Scholar
- Martin Ester, Hans-Peter Kriegel, Jörg Sander, Xiaowei Xu, et al. 1996. A density-based algorithm for discovering clusters in large spatial databases with noise. In KDD, Vol. 96. 226--231. Google ScholarDigital Library
- Evernote. 2016. Evernote Web Clipper. Retrieved from https://evernote.com/webclipper/.Google Scholar
- ExtJs. 2016. Sencha Ext JS. Retrieved from https://www.sencha.com/products/extjs/.Google Scholar
- Suhit Gupta, Gail Kaiser, David Neistadt, and Peter Grimm. 2003. DOM-based content extraction of HTML documents. In Proceedings of the 12th International Conference on World Wide Web. ACM, 207--214. Google ScholarDigital Library
- PhantomJS - Scriptable Headless WebKit. https://github.com/ariya/phantomjs.Google Scholar
- Wen Hua, Zhongyuan Wang, Haixun Wang, Kai Zheng, and Xiaofang Zhou. 2015. Short text understanding through lexical-semantic analysis. In Proceedings of the 2015 IEEE 31st International Conference on Data Engineering. IEEE, 495--506.Google ScholarCross Ref
- Lawrence Hubert and Phipps Arabie. 1985. Comparing partitions. Journal of Classification 2, 1 (1985), 193--218.Google ScholarCross Ref
- Zexun Jiang, Ruifeng Kuang, Jiaying Gong, Hao Yin, Yongqiang Lyu, and Xu Zhang. 2018. What makes a great mobile app? A quantitative study using a new mobile crawler. In Proceedings of the 2018 IEEE Symposium on Service-Oriented System Engineering (SOSE). IEEE, 222--227.Google ScholarCross Ref
- Christian Kohlschütter and Wolfgang Nejdl. 2008. A densitometric approach to web page segmentation. In Proceedings of the 17th ACM Conference on Information and Knowledge Management. ACM, 1173--1182. Google ScholarDigital Library
- Rupesh R. Mehta, Pabitra Mitra, and Harish Karnick. 2005. Extracting semantic structure of web documents using content and visual information. In Special Interest Tracks and Posters of the 14th International Conference on World Wide Web. ACM, 928--929. Google ScholarDigital Library
- William M. Rand. 1971. Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association 66, 336 (1971), 846--850.Google ScholarCross Ref
- React. 2017. A JavaSscript Library for Building User Interfaces. Retrieved from https://facebook.github.io/react/.Google Scholar
- Andres Sanoja and Stephane Gancarski. 2014. Block-o-matic: A web page segmentation framework. In Proceedings of the International Conference on Multimedia Computing and Systems (ICMCS’14). IEEE, 595--600.Google ScholarCross Ref
- Yayuan Tang, Hao Wang, Kehua Guo, Yizhe Xiao, and Tao Chi. 2018. Relevant feedback based accurate and intelligent retrieval on capturing user intention for personalized websites. IEEE Access 6 (2018), 24239--24248.Google ScholarCross Ref
- Karane Vieira, Altigran S. da Silva, Nick Pinto, Edleno S. de Moura, Joao Cavalcanti, and Juliana Freire. 2006. A fast and robust method for web page template detection and removal. In Proceedings of the 15th ACM International Conference on Information and Knowledge Management. ACM, 258--267. Google ScholarDigital Library
- VIPS-JAVA {n.d.}. Implementation of Vision Based Page Segmentation Algorithm in Java. Retrieved from https://github.com/tpopela/vips-java.Google Scholar
- Tim Weninger, William H Hsu, and Jiawei Han. 2010. CETR: Content extraction via tag ratios. In Proceedings of the 19th International Conference on World Wide Web. ACM, 971--980. Google ScholarDigital Library
- Yulei Wu, Fei Hu, Geyong Min, and Albert Y. Zomaya. 2017. Big Data and Computational Intelligence in Networking. CRC Press.Google Scholar
- Jan Zeleny, Radek Burget, and Jaroslav Zendulka. 2017. Box clustering segmentation: A new method for vision-based web page preprocessing. Information Processing 8 Management 53, 3 (2017), 735--750. Google ScholarDigital Library
Index Terms
- Constructing Novel Block Layouts for Webpage Analysis
Recommendations
Enhancement of Flash Webpage Segmentation for Web Mining Applications
ICONIAAC '14: Proceedings of the 2014 International Conference on Interdisciplinary Advances in Applied ComputingWeb page segmentation is a crucial step for many applications like information retrieval, text classification, noise removal, full text search and automatic page adaptation can benefit from this structure. In literature, many methods have been proposed ...
Improve the Performance of the Webpage Content Extraction Using Webpage Segmentation Algorithm
IFCSTA '09: Proceedings of the 2009 International Forum on Computer Science-Technology and Applications - Volume 01In this paper, we present a method using webpage segmentation algorithm to improve the performance of the webpage content extraction. The traditional methods often depend on parsing the DOM tree of the webpage and judging each node of the DOM tree to ...
Task-Driven Webpage Saliency
Computer Vision – ECCV 2018AbstractIn this paper, we present an end-to-end learning framework for predicting task-driven visual saliency on webpages. Given a webpage, we propose a convolutional neural network to predict where people look at it under different task conditions. ...
Comments